Smoothed Bloom Filter Language Models: Tera-Scale LMs on the Cheap

نویسندگان

  • David Talbot
  • Miles Osborne
چکیده

A Bloom filter (BF) is a randomised data structure for set membership queries. Its space requirements fall significantly below lossless information-theoretic lower bounds but it produces false positives with some quantifiable probability. Here we present a general framework for deriving smoothed language model probabilities from BFs. We investigate how a BF containing n-gram statistics can be used as a direct replacement for a conventional n-gram model. Recent work has demonstrated that corpus statistics can be stored efficiently within a BF, here we consider how smoothed language model probabilities can be derived efficiently from this randomised representation. Our proposal takes advantage of the one-sided error guarantees of the BF and simple inequalities that hold between related n-gram statistics in order to further reduce the BF storage requirements and the error rate of the derived probabilities. We use these models as replacements for a conventional language model in machine translation experiments.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Cuckoo Filter Modification Inspired by Bloom Filter

Probabilistic data structures are so popular in membership queries, network applications, and so on. Bloom Filter and Cuckoo Filter are two popular space efficient models that incorporate in set membership checking part of many important protocols. They are compact representation of data that use hash functions to randomize a set of items. Being able to store more elements while keeping a reaso...

متن کامل

Morpheme level hierarchical pitman-yor class-based language models for LVCSR of morphologically rich languages

Performing large vocabulary continuous speech recognition (LVCSR) for morphologically rich languages is considered a challenging task. The morphological richness of such languages leads to high out-of-vocabulary (OOV) rates and poor language model (LM) probabilities. In this case, the use of morphemes has been shown to increase the lexical coverage and lower the LM perplexity. Another approach ...

متن کامل

Integrating High and Low Smoothed LMs in a CSR System

1 In Continuous Speech Recognition (CSR) systems, acoustic and Language Models (LM) must be integrated. To get optimum CSR performances, it is well-known that heuristic factors must be optimised. Due to its great effect on final CSR performances, the exponential scaling factor applied to LM probabilities is the most important. LM probabilities are obtained after applying a smoothing technique. ...

متن کامل

Routing on large scale mobile ad hoc networks using bloom filters

A bloom filter is a probabilistic data structure used to test whether an element is a member of a set. The bloom filter shares some similarities to a standard hash table but has a higher storage efficiency. As a drawback, bloom filters allow the existence of false positives. These properties make bloom filters a suitable candidate for storing topological information in large-scale mobile ad hoc...

متن کامل

Gradient Compared Lp-LMS Algorithms for Sparse System Identification

In this paper, we propose two novel p-norm penalty least mean square (lp-LMS) algorithms as supplements of the conventional lp-LMS algorithm established for sparse adaptive filtering recently. A gradient comparator is employed to selectively apply the zero attractor of p-norm constraint for only those taps that have the same polarity as that of the gradient of the squared instantaneous error, w...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007